The aim of this project….
This project used data from 1500 residential property sales in Ames, Iowa between 2006 and 2012. There are 82 explanatory variables in the data set, containing - nominal, ordinal, discrete, and continuous attributes. Continuous variables provide information about the multiple area dimensions of the house and property, such as the the size of the lot, garage among others. Discrete variables, on the other hand, quantify characteristics of the house/properties like the number of kitchens, baths, bedrooms, and parking spots. Nominal variables, generally, describe the multiple types of materials and locations, such name of the neighborhood or the type of foundations. Ordinal variables typically rate the condition and quality of multiple house characteristics and utilities.
Prior to doing the exploratory data analysis, we hypothesize that the following variables will be the most predictive of home price: lot area, home type, year built, and overall quality. We think these will be the most predictive because we assume that if we were to be in the market for a home, these would be among the top criteria we would consider when deciding which home to purchase.
Furthermore, we also hypothesize that a generalized additive model (GAM) will be the best model to use. We think so because the GAM will be able to combine the strengths of various different other model types including polynomials, cubic splines, and smoothing splines.
Sale Price graph
When it comes to lot area, this dataset has many outliers as shown above. We found that there were 127 outliers greater than the minimum outlier value of 17755. As these made visualization difficult, we temporarily removed them. After removing the outliers, we can see that homes have a somewhat normal distribution in terms of lot area near the median of 9436.5 square feet.
From Figure 3, we see that 1-story homes that were built in 1946 or later make up the bulk of our dataset, specifically 1079. This is over one-third of our total dataset which has 2930 observations. Please not that the graphs are interactive so move your cursor over the graph to see more details. Furthermore, we can also observe from Figure 4, that most homes were built within a 5 year time range of 2005.
Summary Statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.511 4.000 5.000
We can observe from Figure 5 that there is a large variation in sale price across across different neighborhoods. Even within neighborhood we also see variation. Investigating some housing characteristics may give us insight into the variation observed in price within neighborhoods. We first examined overall quality (Figure 6) and - as expected - price increases as overall quality increases.
Examining year built (Figure 7), we observe that the the newer a home is, the higher its price, on average. In addition investigating the relationship between sale price with location, overall quality, and age of the house, we also examined at the relationship between sale price and home type. We find that 2 story homes built in the year 1946 or later have the highest median home prices (Figure 8).
Figure 9 explores the relationship between kitchen quality and sale price.The higher the kitchen quality the higher the median sale price. This increase, however, is non-linear (but rather quadratic). From Figure 10, we can see that - as expected - there is a gradual positive relationship between lot area and sales price.
Sale Price:
Missing data:
Modifying variable class: We decided to keep the quality variables selected as a continuous variable as opposed to switching it to a factor. We did so because changing it to a factor would have lead to us dropping the “Very Poor” or “1” factor level as this level only has around 4 observations. By keeping the variable continuous, we are able to keep these observations and so better predict the home prices of homes that fall under this category.
Model Selection:
## [1] 7
## [1] 7
#Best subset selection
regfit.subset <- regsubsets(saleprice~.,
data=training,
nbest = 1,
nvmax = 7,
method="exhaustive", really.big = TRUE)
regfit.subset_sum<-summary(regfit.subset)
#regfit.subset_sum
#Plot of BIC for the subsets
num_variables<-seq(1,length(regfit.subset_sum$bic))
#Note: Generates a sequence of numbers from 1 to the length of bic in lifexp.summary
plot_BIC<-ggplot(data = data.frame(regfit.subset_sum$bic),
aes(x=num_variables,y=regfit.subset_sum$bic))+
geom_line()+
geom_point(x=which.min(regfit.subset_sum$bic),
y=min(regfit.subset_sum$bic),aes(color="green"),
show.legend = FALSE)+
xlab("# Variables")+
ylab("BIC")+
theme_bw()
plot_BIC
# BIC selected best subset model
which.min(regfit.subset_sum$bic)
## [1] 7
names(coef(regfit.fwd, 7))
## [1] "(Intercept)" "tot_rms_abv_grd" "overall_qual"
## [4] "lot_area" "Bsmt.Qual" "Kitchen.Qual"
## [7] "NeighborhoodNorthridge" "BsmtFin.Type.1Unf"
names(coef(regfit.bwd, 7))
## [1] "(Intercept)" "tot_rms_abv_grd" "overall_qual"
## [4] "lot_area" "Bsmt.Qual" "Kitchen.Qual"
## [7] "NeighborhoodNorthridge" "BsmtFin.Type.1Unf"
names(coef(regfit.subset, 7))
## [1] "(Intercept)" "tot_rms_abv_grd" "overall_qual"
## [4] "lot_area" "Bsmt.Qual" "Kitchen.Qual"
## [7] "NeighborhoodNorthridge" "BsmtFin.Type.1Unf"
Chosen Variables: lot_area, tot_rms_abv_grd, overall_qual, Kitchen.Qual, year_built, NeighborhoodNorthridge, Bsmt.QualEx
# Function that trains a degree d polynomial on the training data and returns its prediction error on the test data. It is assumed that train and test are data frames, with 2 columns: first named x, the second named y. Output: The test MSE of the model
polyTestErr <- function(dat, train, d) {
poly.fit <- lm(y ~ poly(x, degree = d), data = dat, subset = train)
preds <- predict(poly.fit, dat)[-train]
mean((dat$y[-train] - preds)^2)
}
cubicSplineTestErr <- function(dat, train, df) {
if(df >= 3) {
spline.fit <- lm(y ~ bs(x, df = df), data = dat, subset = train)
preds <- predict(spline.fit, dat)[-train]
mean((dat$y[-train] - preds)^2)
} else {
NA
}
}
smoothSplineTestErr <- function(dat, train, df) {
if(df > 1) {
spline.fit <- with(dat, smooth.spline(x[train], y[train], df = df))
preds <- predict(spline.fit, dat$x)$y[-train]
mean((dat$y[-train] - preds)^2)
} else {
NA
}
}
smoothCV <- function(x, y, K = 10, df.min = 1, df.max = 10) {
dat <- data.frame(x = x, y = y) #creates a data frame out the x and y vectors inputted
n <- length(y) # number of observations
num.methods <- 3 #the number of method types
method.names <- c("poly", "cubic.spline", "smoothing.spline")
err.out <- data.frame(df = rep(df.min:df.max, each = num.methods),
method = rep(method.names, df.max - df.min + 1))
# Get a random permutation of the indexes
random.perm <- sample(n)
# break points for the folds. If n is not evenly divisible by K,
# these may not be of exactly the same size.
fold.breaks <- round(seq(1,n+1, length.out = K + 1))
fold.start <- fold.breaks[1:K]
fold.end <- fold.breaks[2:(K+1)] - 1
fold.end[K] <- n # Fix the last endoint to equal n
fold.size <- fold.end - fold.start + 1 # num obs in each fold
cv.err <- NULL
fold.err <- matrix(0, nrow = K, ncol = 3)
colnames(fold.err) <- c("poly", "cubic.spline", "smoothing.spline")
# Outer loop: Loop over degrees of freedom
# Inner loop: Iterate over the K folds
for(df in df.min:df.max) {
for(k in 1:K) {
test.idx <- fold.start[k]:fold.end[k]
train <- random.perm[-test.idx]
# Calculate test error for the three models
poly.err <- polyTestErr(dat, train = train, d = df)
cubic.spline.err <- cubicSplineTestErr(dat, train = train, df = df)
smooth.spline.err <- smoothSplineTestErr(dat, train = train, df = df)
# Store results for this fold
fold.err[k,] <- c(poly.err, cubic.spline.err, smooth.spline.err)
# print(fold.err[k,])
}
# Perform weighted averaging to calculate CV error estimate
# MSE estimates from each fold are weighted by the size of the fold
# If all folds are the same size, this is the same thing as the unweighted
# average of all of the MSE's
err.ave <- colSums(sweep(fold.err, MARGIN = 1, fold.size, FUN = "*") / n)
cv.err <- c(cv.err, err.ave)
}
err.out$cv.error <- cv.err
err.out
}
plot.smoothCV <- function(smoothcv.err, K, title.text = "", facet = FALSE,
y.scale.factor = NULL) {
# Convert the method names
dat <- transform(smoothcv.err,
method = mapvalues(method,
c("poly", "cubic.spline", "smoothing.spline"),
c("Polynomial", "Cubic spline", "Smoothing Spline")
)
)
# Set axes labels
x.text <- "Degrees of Freedom"
y.text <- paste0(K, "-fold CV Error")
# The ggplot "setting": data, axes, and color by method
p <- ggplot(data = dat, aes(x = df, y = cv.error, colour = method))
# Overlay with line plots, data points, axes labels, and graph title
p <- p + geom_line() + geom_point() + xlab(x.text) + ylab(y.text) +
ggtitle(title.text)
# Adjust the y axis range if y.scale.factor is specified
if(!is.null(y.scale.factor)) {
min.err <- min(dat$cv.error, na.rm = TRUE)
p <- p + ylim(min.err, y.scale.factor * min.err)
}
# Show a separate plot per method if facet=TRUE
if(!facet) {
print(p)
} else {
print(p + facet_wrap("method"))
}
}
#Chosen Variables: lot_area, tot_rms_abv_grd, overall_qual, Kitchen.Qual, year_built, NeighborhoodNorthridge, Bsmt.QualEx
cv.lot_area <- smoothCV(x = training$lot_area,
y = training$saleprice,
df.min = 1,
df.max = 10)
plot.smoothCV(cv.lot_area,
K = 10,
title.text = "CV Error: saleprice ~ lot_area",
y.scale.factor = 1.5)
## Warning: Removed 9 row(s) containing missing values (geom_path).
## Warning: Removed 9 rows containing missing values (geom_point).
A degree 2 smoothing spline appears to be the best model choice for lot area. It has the lowest CV error and the lowest has the most stable curve.
#Chosen Variables: lot_area, tot_rms_abv_grd, overall_qual, Kitchen.Qual, year_built, NeighborhoodNorthridge, Bsmt.QualEx
cv.tot_rms_abv_grd <- smoothCV(x = training$tot_rms_abv_grd,
y = training$saleprice,
df.min = 1,
df.max = 10)
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(spline.fit, dat): prediction from a rank-deficient fit may
## be misleading
plot.smoothCV(cv.tot_rms_abv_grd,
K = 10,
title.text = "CV Error: saleprice ~ tot_rms_abv_grd",
y.scale.factor = 1.4)
## Warning: Removed 6 row(s) containing missing values (geom_path).
## Warning: Removed 6 rows containing missing values (geom_point).
A degree 6 smoothing spline appears to be the best fit for the total rooms above grade variable. While a lower degree cubic spine is comparable, the cubic spline becomes more unstable at higher degrees.
#Chosen Variables: lot_area, tot_rms_abv_grd, overall_qual, Kitchen.Qual, year_built, NeighborhoodNorthridge, Bsmt.QualEx
cv.overall_qual <- smoothCV(x = training$overall_qual,
y = training$saleprice,
df.min = 1,
df.max = 8)
plot.smoothCV(cv.overall_qual,
K = 10,
title.text = "CV Error: saleprice ~ overall_qual",
y.scale.factor = 1.3)
## Warning: Removed 3 row(s) containing missing values (geom_path).
## Warning: Removed 3 rows containing missing values (geom_point).
A degree 6 smoothing spline appears to be a good fit here, however other models appear to do comparably as well.
#Chosen Variables: lot_area, tot_rms_abv_grd, overall_qual, Kitchen.Qual, year_built, NeighborhoodNorthridge, Bsmt.QualEx
cv.Kitchen.Qual <- smoothCV(x = training$Kitchen.Qual,
y = training$saleprice,
df.min = 1,
df.max = 3)
plot.smoothCV(cv.Kitchen.Qual,
K = 10,
title.text = "CV Error: saleprice ~ Kitchen.Qual",
y.scale.factor = 1.1)
## Warning: Removed 3 row(s) containing missing values (geom_path).
## Warning: Removed 3 rows containing missing values (geom_point).
A quadratic polynomial appear to be the best fit for this model as it has the lowest error.
#Chosen Variables: lot_area, tot_rms_abv_grd, overall_qual, Kitchen.Qual, year_built, NeighborhoodNorthridge, Bsmt.QualEx
cv.year_built <- smoothCV(x = training$year_built,
y = training$saleprice,
df.min = 1,
df.max = 10)
plot.smoothCV(cv.year_built,
K = 10,
title.text = "CV Error: saleprice ~ year_built",
y.scale.factor = 1.2)
## Warning: Removed 3 row(s) containing missing values (geom_path).
## Warning: Removed 3 rows containing missing values (geom_point).
A cubic spline with 8 degrees of freedom appears to be the best model in this case. Other models are close in CV error and are fairly stable, but the cubic spline model has the lowest error.
#Chosen Variables: lot_area, tot_rms_abv_grd, overall_qual, Kitchen.Qual, year_built, NeighborhoodNorthridge, Bsmt.Qual
cv.Bsmt.Qual <- smoothCV(x = training$Bsmt.Qual,
y = training$saleprice,
df.min = 1,
df.max = 3)
plot.smoothCV(cv.Bsmt.Qual,
K = 10,
title.text = "CV Error: saleprice ~ Bsmt.Qual",
y.scale.factor = 1.1)
## Warning: Removed 3 row(s) containing missing values (geom_path).
## Warning: Removed 3 rows containing missing values (geom_point).
A degree three polynomial appears to be the best option as it has the lowest CV error rate. Cubic spline has only one point so it is unclear whether it has a stable trend.
gam.fit <- gam(saleprice ~ s(lot_area, 2) +
s(tot_rms_abv_grd, 6) +
s(overall_qual, 6) +
poly(Kitchen.Qual, 2) +
bs(year_built, 8) +
poly(Bsmt.Qual, 3) +
Neighborhood + full_bath_abv_grd +
Roof.Style + BsmtFin.Type.1,
data = training)
# GAM Training Error
gam.fit.sum <- summary(gam.fit)
training.gam <- training[,c("tot_rms_abv_grd", "overall_qual", "lot_area", "year_built", "Bsmt.Qual", "Kitchen.Qual", "Neighborhood", "full_bath_abv_grd","Roof.Style", "BsmtFin.Type.1")]
gam.preds.train <- predict.Gam(gam.fit, training.gam)
mean((training$saleprice - gam.preds.train)^2)
## [1] 974957404
# GAM Test Error
testing.gam <- testing[,c("tot_rms_abv_grd", "overall_qual", "lot_area", "year_built", "Bsmt.Qual", "Kitchen.Qual", "Neighborhood", "full_bath_abv_grd","Roof.Style", "BsmtFin.Type.1")]
gam.preds.test <- predict.Gam(gam.fit, testing.gam)
## Warning in bs(year_built, degree = 3L, knots = c(`16.66667%` = 1948, `33.33333%`
## = 1964, : some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(year_built, degree = 3L, knots = c(`16.66667%` = 1948, `33.33333%`
## = 1964, : some 'x' values beyond boundary knots may cause ill-conditioned bases
mean((testing$saleprice - gam.preds.test)^2)
## [1] 982105866
lm.fit <- lm(saleprice ~ lot_area + Neighborhood +
year_built + tot_rms_abv_grd +
overall_qual + Bsmt.Qual +
Kitchen.Qual + full_bath_abv_grd +
Roof.Style + BsmtFin.Type.1 , data = training)
lm.predict <- predict(lm.fit, newdata = testing)
rmse <- sqrt(mean((lm.predict- testing$saleprice )^2))
rmse
## [1] 32828.06
mae <- sum(abs(lm.predict-testing$saleprice ))/nrow(testing)
mae
## [1] 23224.5
#https://towardsdatascience.com/metrics-to-understand-regression-models-in-plain-english-part-1-c902b2f4156f